Red Wine Quality Analysis by Michael Eckstein

## [1] "/Users/mike/Documents/Digital/Other/Training/Udacity Nanodegree/Project 3/Final"

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## [1] 1599
## [1] 1599   13
## alcohol :  num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## chlorides :  num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## citric.acid :  num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## density :  num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## fixed.acidity :  num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## free.sulfur.dioxide :  num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## pH :  num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## quality :  int [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## residual.sugar :  num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## sulphates :  num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## total.sulfur.dioxide :  num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## volatile.acidity :  num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## X :  int [1:1599] 1 2 3 4 5 6 7 8 9 10 ...
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features (residual.sugar, density, quality, fixed.acidity, chlorides, pH, volatile.acidity, free.sulfur.dioxide, sulphates, citric.acid, total.sulfur.dioxide, alcohol). None of the variables are ordered factor, but all numeric or integer values

Other observations: -The median quality is 6.0 ranging from a min of 3 and max of 8 on a scale of 0-10. -The quality has the following number of samples ( 3-10, 4-53, 5-681, 6-638, 7-199, 8-18) -The alocohol content of the red wine ranges between 8.4% and 14.90% with 75% of the red wines below 11.1%

What is/are the main feature(s) of interest in your dataset?

The main feature of the data set is quality. I’d like to determine which features have the greatest impact on the quality of red wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Alcohol, fixed.acidity, volatile.acidity, citric.acide, chlorides, total.sulfur.dioxide, density, sulphates, and alcohol are likely to contribute to the quality of red wine.

Did you create any new variables from existing variables in the dataset?

No, I did not create any new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Fixed.acidity, volatile.acidity, density, pH, alcohol, and quality are close to normal distributions. Residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, and sulphates are skewed to the left in their distribution. Citric.acid is somewhat evently distributed, but appears to have a lot of values at 0 (132 total). It also appears that a few of the features such as residual.sugar, chlorides, free.sulfur.dioxodie, total.sulfur.dioxoide, and sulphates have outliers that could impact the analysis. I log transformed the left skewed distributions.

Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000
## 
##  Pearson's product-moment correlation
## 
## data:  quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  quality and volatile.acidity
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(residual.sugar)
## t = 0.9407, df = 1597, p-value = 0.347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02551727  0.07247084
## sample estimates:
##        cor 
## 0.02353331
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(chlorides)
## t = -7.1508, df = 1597, p-value = 1.308e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2232336 -0.1282260
## sample estimates:
##      cor 
## -0.17614
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(free.sulfur.dioxide)
## t = -2.0041, df = 1597, p-value = 0.04522
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.098865884 -0.001068979
## sample estimates:
##         cor 
## -0.05008749
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(total.sulfur.dioxide)
## t = -6.8999, df = 1597, p-value = 7.476e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2173510 -0.1221403
## sample estimates:
##        cor 
## -0.1701427
## 
##  Pearson's product-moment correlation
## 
## data:  quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## 
##  Pearson's product-moment correlation
## 
## data:  quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  quality and log10(sulphates)
## t = 12.9672, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2636092 0.3523323
## sample estimates:
##       cor 
## 0.3086419
## 
##  Pearson's product-moment correlation
## 
## data:  quality and alcohol
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality has a strong correlation to alcohol, sulphates, citric.acid and a negative relationship to volatile.acidity. These relationships make sense based upon each attribute. Volatile.acidity is the amount of acetic acid in the wine and a higher value means more of an unpleasant, vinegar taste. Citric.acid can add freshness and flavor to wines. Sulphaes can help keep wine fresh. Total.sulfur.dioxide appears to have a lower relationship since low amounts are prevalent in lower and higher quality wines whereas higher amounts exist in mid-quality wines. Total.sulfur.dioxide (SO2) becomes evident over 50 ppm and becomes evident in the nose and taste of wine, which is why they are rated at the mid-level.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Quality also has a smaller correlation to fixed.acidity, chlorides, density. Citric.acid has a very strong relationship to fixed.acidity and a negative relationship to volatile.acidity. Since all of these are acids, they impact pH level of the wine. Density is a result of alcohol and sugar within the wine. Free.sulfur.dioxide has an impact on the total.sulfur.dioxide of the wine as well.

What was the strongest relationship you found?

The strongest relationship to quality is alcohol. Beyond quality, it was the fixed.acidity to pH and the fixed.acidity to density.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Exploring alcholol, sulphates, citric.acid, volatile.acidity, and total.sulfur.dioxide levels and the impact on quality, I was able to show a strong relationship between alcohol, acidity (higher citric.acid and lower volatile.acidity), and sulphates. Alcohol clearly has the largest impact but sulphates and citric.acid also show an interesting relationship since the higher quality wines were plotted in the upper right of the graphs for those features. On the other hand, due to the negative correlation, volatile.acidity and alcohol were plotted in the lower right.

Were there any interesting or surprising interactions between features?

The relationship between a lower volatile.acidity and higher citric.acid is more prevelant in the diagram. This supports that citric.acid adds flavor and freshness and the volatile.acidity negatively impacts the flavor.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

By plotting the quality of the wine to the alcohol content, we are clearly able to see the relationship between the two. As the alcohol content increases, the quality of the wine also increases on a near linear scale. There is some overlap based on alcohol content between the different quality ratings which means that alcohol is important but it is also a combination of other factors that play a role in the quality of a wine.

Plot Two

Description Two

We were able to determine that a strong correlation exists between alcohol and quality and a strong negative correlation exists between quality and volatile.acidity. This graph shows that relationship where a higher alcohol content and lower volatile.acidity produces a higher quality wine (better wines in lower left of graph).

Plot Three

Description Three

This boxplot demonstrates the effect of citric.acid on the quality of wine. We also only see one outlier from the dataset. Plotting the median along with the boxplot shows the increase of citric.acid along with the quality. As a result, we are able to determine that the higher the citric.acid levels in a wine, the better the quality rating.


Reflection

I was able to investigate the different features of the data set and perform an analysis to determine which had the greatest impact on quality. The features that factored into quality the most were alcohol content, sulphates, and acidity (citric.acid and volatile.acidity). The correlations and graphs illustrated the relationships between these features and the trends that resulted from increasing or decreasing the amount of each in wine. Although there may be some variation, the highest quality wines were higher in alcohol content, sulphates, and citric.acid while having a lower volatile.acidity. This resulted in the freshest, best tasting wines that were desired most by the experts rating the wines. The analysis could be enriched by performing a more in depth comparison of the relationships between all of the different features. I only looked at the top 5 correlations so this analysis could help provide more information to support the conclusion. I also could have accounted for the entries of 0 for citric.acid (or other data quality). Overall, the analysis provided a great opportunity to explore the data set and enforce the skills learned through the lessons.